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This  project  has  focused  on  development  of  active  learning  and  semi-supervised 
learning  algorithms  for  underwater  sensing.  The  research  is  being  executed  in  collabo¬ 
ration  with  NSWC  Panama  City,  which  serves  as  a  good  source  of  data  for  algorithm 
testing,  and  a  transition  point  for  the  algorithms.  A  long-term  goal  is  to  transition  the 
active-learning  technology  to  the  NSWC  software  suite. 

OBJECTIVES 

Context  plays  an  important  role  when  performing  underwater  classification,  and  in 
this  report  we  examine  context  from  two  perspectives.  First,  the  classification  of  items 
within  a  single  task  is  placed  within  the  context  of  distinct  concurrent  or  previous 
classification  tasks  (multiple  distinct  data  collections).  This  is  referred  to  as  multi-task 
learning  (MTL),  and  is  implemented  here  in  a  statistical  manner,  using  a  simplified 
form  of  the  Dirichlet  process.  In  addition,  when  performing  many  classification  tasks 
one  has  simultaneous  access  to  all  unlabeled  data  that  must  be  classified,  and  therefore 
there  is  an  opportunity  to  place  the  classification  of  any  one  feature  vector  within  the 
context  of  all  unlabeled  feature  vectors;  this  is  referred  to  as  semi-supervised  learning. 
In  this  report  we  integrate  MTL  and  semi-supervised  learning  into  a  single  framework, 
thereby  exploiting  two  forms  of  contextual  information. 

A  key  new  objective  of  the  research  is  to  adapt  the  features  to  the  environment.  For 
this  purpose  we  have  introduced  the  Beta  Process,  the  development  and  application  of 
which  have  been  an  important  component  of  the  research  executed  over  the  last  year, 
and  reported  here. 

APPROACH 

The  Dirichlet  process  (DP)  [1],  [2],  [3],  denoted  as  VV(aG0),  is  a  prior  used  in  non- 
parametric  Bayesian  models,  for  the  purpose  of  clustering  data.  It  is  a  distribution  over 
probability  measure,  i.e.,  each  draw  G  from  a  Dirichlet  process  is  itself  a  distribution. 
The  base  measure  Go  is  the  prior  mean  of  the  DP  and  the  concentration  parameter  a, 
acting  as  the  inverse  variance,  controls  how  much  the  draw  G  from  a  DP  is  allowed  to 
deviated  from  the  base  measure  G0.  The  larger  a  is,  the  smaller  the  variance  is,  and  G 
will  concentrate  more  of  its  mass  around  the  mean  Go-  In  the  limit  as  a  — >  oo,  G  goes 
to  Go  weakly  or  pointwise.  However,  we  should  note  that  we  cannot  say  that  G  goes 
to  G0  in  the  limit  as  a  — >  oo  since  draws  from  a  DP  will  be  discrete  distributions  with 
probability  one,  even  if  the  base  measure  G0  is  continuous.  In  the  limits  as  a  — *  0,  G 
takes  random  discrete  values. 

Let  {01,02,  ■  ■  ■ ,  0n }  be  a  sequence  of  independent  draws  from  a  prior  G,  with  G 
itself  a  sample  from  a  W(aG0).  The  mathematical  representation  of  the  DP  model  is: 

0i  ~  G 

G  ~  DP(aG0)  (1) 

Marginalizing  out  G,  the  conditional  distribution  0n+i  given  the  other  N  observations 


{01  A,  •  •  •  Atv}  is 
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P(0N+l\0l,  02,  ■  ■  ■  ,  0N,  OL,  Go) 

where  8gn  is  a  point  mass  located  at  0n. 


N 


a  +  N 


E* 

n= 1 


+ 


a 


a  +  N 


;Gn 


(2) 


Note  that  the  draw  G  from  a  DP  is  discrete,  and  consequently  multiple  (9,  ’s  may  take 
the  same  value  simultaneously.  Let  01,  k  —  1,2, ,  K,  denote  K  distinct  values  among 
0i,02, ...  ,0n  and  nk  be  the  number  of  repeats  of  0*k.  The  conditional  distribution  of 
(2)  can  be  equivalently  written  as: 


p(0n+i \0i,  02,  ••• ,  0n,  cn,  G0) 


nk 

a  +  N 


J25et  + 


a 

a  +  N 


Go 


(3) 


From  (3),  we  notice  that  the  probability  of  0N  taking  the  value  of  0*k  is  proportional  to 
nk.  The  larger  the  nk  is,  the  more  probable  Ay  will  take  the  value  0*k.  This  phenomenon 
can  be  called  a  clustering  property  since  a  new  observation  tends  to  join  the  group  with 
a  larger  number  of  samples. 

As  discussed  above,  draws  from  a  DP  are  discrete  distributions.  From  the  stick¬ 
breaking  construction  [4]  this  points  is  made  more  explicit  by  providing  a  constructive 
form  of  a  draw  from  a  DP.  It  is  simply  given  as  follows: 

OO 

G  = 

k= 1 
k- 1 

Kk  =  A:  I^[(l  -  A) 

1=1 

A  Beta(l,  a) 

0*k  ~  Go  (4) 

Here  0  <  7ik  <  1  and  tt/c  =  1-  It  is  clear  from  the  construction  form  of  G  that 

draws  from  a  DP  are  discrete,  composed  of  an  infinite  weighted  sum  of  point  masses. 
The  construction  of  7r  can  be  understood  as  follows.  Starting  with  a  stick  of  length  1, 
we  break  it  at  A  ~  Beta(l,  oj  and  assign  tt,  to  be  the  length  of  stick  we  just  broke 
off.  Recursively  breaking  the  remaining  portion  at  (32,  (32, . . .,  we  get  the  length  n2,  n3 
and  so  forth.  Since  the  lengths  nk  decrease  stochastically  with  k,  the  summation  in 
(4)  may  be  truncated  with  N  terms  G  =  J2k= yielding  an  N  level  truncated 
approximation  to  a  draw  G  from  the  DP  [5],  In  [51  is  gaven  a  bound  for  the  error 
introduced  by  the  truncation  in  the  DP. 

Assuming  we  have  a  set  of  data  {x\,x2, . ..  ,xn}  with  associated  hidden  parameters 
{0i,  02, ... ,  0n},  each  0n  is  drawn  independently  and  identically  from  G,  while  each  xn 
has  distribution  F{0n).  Since  G  is  discrete  and  multiple  (9,,  ’s  may  take  the  same  value 
01,  datapoints  x„’s  associated  with  the  same  value  01  belong  to  one  cluster.  Such  a  kind 
of  model  of  data  can  be  viewed  as  a  mixture  model  with  countable  infinite  components. 
Let  Zi  be  a  cluster  assignment  variable,  which  takes  on  value  k  with  probability  ^ ik. 
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The  generative  infinite  mixture  model  can  be  expressed  as 


Xi\zi,{9*k}  ~ 

Zi\n  ~ 
7T  |  CK  ~ 
61  ~ 


F{6*z.) 

stick(a ) 
Go 


(5) 


where  stick(a )  is  stick-breaking  process  with  paramter  a.  Different  from  the  finite 
mixture  model,  which  uses  a  fixed  number  of  clusters  to  model  the  data,  the  number 
of  distinct  values  of  6n  with  a  DP  prior  is  driven  by  data  as  well  as  the  concentration 
parameter  a.  In  our  work  instead  of  setting  a  fixed  value  for  a,  a  Gamma  hyper-prior 
over  a  is  employed,  which  yields  greater  model  flexibility. 


The  beta  process  (BP)  was  first  introduced  by  Hjort  [6]  for  applications  in  survival 
analysis.  A  beta  process  BV(cB0)  is  an  independent  increments  or  Levy  process  with 
concentration  parameter  c  and  base  measure  B0.  Let  Q  be  a  measurable  space  and  B  its 
cr-algebra.  For  all  disjoint,  infinitesimal  partition  B\,  £>2, . . . ,  Bk  of  Q,  the  Beta  process 
is  generated  as 

H(Bk)  ~  Beta(cB0(Bk ),  c(l  -  B0(Bk)))  (6) 

A  draw  from  a  Beta  process  (BV)  can  be  constructed  as  follows 

B  =  ^  ^  T^k^uik 

k 

uk  ~  Bo 

7Tfc  ~  Beta{cBQ{uk),c{l-Bo{uk)))  (7) 

where  5Uk  is  a  unit  mass  concentrated  at  ujk  (c ok  G  Vt).  Note  from  (7),  J2k= i  ^  1, 
therefore  B  can  not  be  treated  as  a  probability  mass  function. 

The  two-parameter  Beta  process  BV{cB{))  can  be  extended  to  a  three  parameter  Beta 
process  BV(a,  b,  B0)  [7],  which  is  specifically  as 

H(Bk)  ~  Beta(aB0(Bk),  6(1  -  B0{Bk)))  (8) 

Another  interesting  process  closely  related  to  Beta  process  is  Bernoulli  process  [8], 
denoted  as  X  ~  BeP(B),  where  B  is  a  measure  on  fl  If  B  is  continuous,  X  is  a 
Poisson  process  with  intensity  B  and  can  be  constructed  as 

N 

x  =  (?) 

i=l 

where  N  ~  Poi(B(Q))  and  ay  are  independent  draws  distribution  from  B/B(Q).  In 
the  case  for  which  B  is  discrete  and  let  B  =  J2k  nk^uik,  X  has  following  construction 
form, 

x  —  y  ]  zk$ujk 

k 

zk  ~  Bernoulli^k )  (10) 

X  is  then  a  set  of  elements  which  only  take  value  {0, 1}  at  different  locations  u>k. 
For  our  application,  Q  can  be  thought  as  a  space  of  potential  features  and  X  as  an 
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observation  (or  a  datum)  which  possesses  a  part  of  features.  The  possession  of  features 
are  different  for  different  data  and  determined  by  zk. 


Let  {XUX2,  .  .  .,Xn}  be  n  independent  draws  of  BeP(B)  from  discrete  B  and  B 
is  a  draw  from  BV(cB0).  Marginalizing  out  B,  the  predictive  conditional  probability 
of  a  new  draw  Xn+i  given  the  previous  draws  {Xi,X2, . . .  ,Xn}  is 


Xn+\ \Xi,X2, . . . ,  xn 


BeP( 


c 


c  +  n 


-Bn 


c  +  n 


Y.x. 


i=  1 


BeP(-?-B0  +  Y) 
c  +  n 


I< 


k=  1 


c  +  n 


SWk) 


(11) 


where  mn^k  is  the  number  of  n  data  having  possessed  feature  wk. 


Let  c  =  1,  7  =  B0(Q)  and  {Xi,  X2, . . . ,  Xn+{\  be  generated  in  sequence.  This 
generating  process  is  then  reduced  to  Indian  buffet  process  IBPfy,  n)[9], 

•  For  the  first  observation  (custom)  Xi,  the  number  of  features  possessed  (or  the 
number  of  dishes  the  customer  tastes)  is  Poi(B0(Q))  or  Poz(y); 

•  For  subsequent  observations  (customs)  Xl+\ ,  i  =  1,2,  the  probability  of 

selecting  previous  features  (or  old  dishes)  uJk  is  jtj,  where  ^4  is  the  number  of 
previous  i  observations  (customers)  selecting  feature  (dish)  ujk\  the  number  of  new 
features  (or  dishes)  Xi  will  select  is  Poi(B0(Cl) /(i  +  1))  =  Poi{^/(i  +  1)). 

As  mentioned  in  [9],  the  Indian  buffet  process  is  the  limiting  case  of  a  finite  feature 
model  as  K  the  number  of  potential  features  tends  to  infinity.  The  finite  feature  model 
provides  a  full  conjugacy  and  will  allow  for  variational  inference  to  be  performed  on 
the  multi-task  feature  selection.  The  finite  latent  feature  model  may  be  defined  as 

=  it*** 

k=  1 

Bernoulli  (717) 

~  Beta(~,  1)  (12) 

Here  we  assume  that  each  feature  is  independent  from  each  other  and  could  be  selected 
by  each  data  with  same  probability,  i.e.,  B0(Bk)  =  l/K  for  k  —  1,  2, ... ,  K  regions. 
Extending  one  parameter  Beta  distribution  Beta(^,  1)  to  two  parameters  Beta  distri¬ 
bution  Beta(-^,b^^),  we  may  obtain  more  flexible  model  for  our  data. 

Let  p(t/i|A/i(xi),  0)  denote  a  neighborhood-based  classifier  parameterized  by  6,  rep¬ 
resenting  the  probability  of  class  label  yt  for  x„  given  the  neighborhood  of  x,  [10]. 
The  semi- supervised  classifier  is  defined  as  a  mixture 

p(yi\J\ft{xi),  0)  =  E”=i%P*(2/;|xA  e)  (!3) 

Let  p*(i/j|xj,  9)  be  a  base  classifier  parameterized  by  9,  which  gives  the  probability 
of  class  label  y*  of  data  point  x,.  The  base  classifier  can  be  implemented  by  any 
parameterized  probabilistic  classifier.  For  binary  classification  with  y  e  {1,  0},  the  base 
classifier  can  be  chosen  as  a  probit  regression 


%ik 


p*(yt  =  l|xj,  9)  =  p(zi  >=  0|xj,  9) 
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r°°  I  _||r.  -  QT x- 1 12 

=  l  V^eM  '  2  '  )dZ-  °4) 

where  a  constant  element  1  is  assumed  to  be  prefixed  to  each  x(the  prefixed  x  is  still 
denoted  as  x  for  notational  simplicity),  and  thus  the  first  element  in  6m  is  a  bias  term. 
bij  represents  the  probability  walking  from  x,  to  x;  in  t  steps  [10]. 

Let  £  C  {1,2,  ■  ■ ■ ,n}  denote  the  index  set  of  labeled  data  in  X.  The  neighborhood- 
conditioned  likelihood  function  is  written  as 

p({yi,i  e  £}|{M(xi)  :  i  e  C},0) 

n 

=  Y[p(yi\Nt(xi),o)  =  WYMinv^Q)  (15) 

i£C  iec  j= 1 

Suppose  we  are  given  M  tasks,  defined  by  M  partially  labeled  data  sets 

Vm  =  {x™  :i  =  1,  2,  •  ■  ■  ,  nm}  U  { y ™  :  i  G  Cmj 

for  m  —  1,  •  ■  ■  ,  M,  where  y™-  is  the  class  label  of  x"1  and  Cm  C  {1,  2,  ■  •  •  ,  nm}  is 
the  index  set  of  labeled  data  in  task  m.  We  consider  M  semi-supervised  classifiers, 
parameterized  by  0m,  m  =  1,  •  •  •  ,  M,  with  Qrn  responsible  for  task  m. 


Assuming  that,  given  {01;  •  •  •  ,  0M },  the  class  labels  of  different  tasks  are  condition¬ 
ally  independent.  Substituting  (14)  into  (15)  the  joint  likelihood  function  over  all  tasks 
can  be  written  as 


p{{vT,  i  6  £,„}"=! I {MtxD  :  *  6  {»„}",,) 

M  'Tim 

n  n  e-») 


m=  1  ieCm  j= 1 
Ad  Tim 

=  n  n  e 

m=l  *e£m  j=l 


_ 1 1  _  oT 

\  \  "ij 


-)dz[ 


m 

ij 


(16) 


where  the  m-th  term  in  the  product  is  taken  from  (15),  with  the  superscript  m  indicating 
the  task  index.  Note  that  the  neighborhoods  are  built  for  each  task  independently  of 
other  tasks,  thus  a  random  walk  is  always  restricted  to  the  same  task  (the  one  where 
the  starting  data  point  belongs)  and  can  never  traverse  multiple  tasks. 


Our  objective  is  to  learn  {0^  •  •  •  ,  0M}  jointly,  sharing  information  between  tasks  as 
appropriate;  at  the  same  time,  we  also  hope  that  within  each  cluster  (a  group  of  similar 
tasks),  the  classifier  could  automatically  exclude  irrelevant  features  and  focus  on  useful 
ones. 


Since  =  1,  ^  //(yf’jx™,  Qrn)  in  (16)  can  be  treated  as  a  mixture 

model  and  we  can  get  rid  of  the  summation  by  introducing  a  hidden  variables  t"'  -  the 
index  of  datum  to  which  x”1  may  transit.  The  generative  model  for  the  semi-supervised 
Multi-task  feature  learning  with  probit  model  can  be  written  as 


(4 


>1 


~  I{z%»  >  0)5i  +  I(z$n  <  0)5o 
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@m  —  ©  rn  °  fkm 

©m  ~  G 
G  ~  W(a0G0) 

G0  ~  EeP(5) 

P  ~  BV(a,  b,  B0) 

Wm  ~  JV(0,A,J)  (17) 

where  m  =  1,2 =  l,2,...£m  and  the  symbol  o  represents  the  Hadamard, 
or  elementwise  multiplication  of  two  vectors.  In  this  generative  model,  we  decompose 
classifier  0m  into  two  parts,  ©m  and  Wrn.  A  Dirichlet  prior  W(a0Go )  is  imposed 
over  ©m,  which  gives  ©m,  m  —  1,  2, . . . ,  M  clustering  property  based  on  the  discussion 
above.  The  base  measure  Go  is  a  draw  from  Beta  process  BV(a ,  b,  B0).  Go  is  then  a  set 
of  elements  taking  value  {0, 1}.  From  the  above  background  discussion,  the  proposed 
model  has  such  nice  properties:  (1)  Similar  tasks  will  group  together  to  share  one  ©*,; 
Instead  of  using  a  pre-fixed  number  of  clusters,  the  number  of  clusters  is  inferred  from 
data  itself.  (2)  Since  G0  is  a  draw  from  a  Beta  process,  Q*n  may  be  treated  as  a  controller 
for  feature  learning,  selecting  relevant  features  and  excluding  irrelevant  features.  Data 
from  tasks  within  one  cluster  will  have  the  same  feature  selection  mechanism,  i.e., 
keeping  or  throwing  away  the  same  features  across  all  tasks  within  one  cluster.  The 
tasks  from  different  clusters  then  have  different  feature  selection  mechanism.  (3)The 
introduction  of  parameters  Wm  gives  the  model  more  flexibility  -  it  allows  the  different 
weights  for  those  selected  features  for  different  tasks  in  one  cluster. 


Employing  the  stick-break  construction  for  DP  and  truncating  it  to  level  N ,  as  well 
as  approximating  Beta  process  with  the  finite  latent  feature  models,  the  proposed  model 
(17)  can  be  written  as 


{vT\ 

(z%r\t?,x?r,Vm  =  h,e,wm) 

Wv) 

(Vft|a0) 

(aokio,  r2o) 
(©hdl^d) 
{Td\a,b) 

(Wm|  Po) 


~  I{Z$n  >  0)5!  +  I(z%n  <  0)A0 

~  A(((0ftoWm)Tx™,’l) 

X 

N 

~  y 

h= 1 

=  Vh]l(l-Vl),h  =  l,2,...,N 

l<h 

Beta(  1,  «o) 

~  Ga(r10,r2o) 
rs./  Bernoulli (t,i) 
rs./  Beta(a,  b) 

-  JV(o,AjI) 


(18) 


Here  we  impose  a  Gamma  hyper-prior  Ga(rw,  r2 0)  over  concentration  parameter  a0  for 
the  DP  prior.  And  Wm  is  independent  drawn  from  a  Gaussian  distribution  with  mean 
zero  and  covariance  matrix  f30I.  The  full  likelihood  function  of  the  model  is 


p(y,  t,  z,  7],  0,  W,  V,  r,  a0,  a|x,  rw,  r20,  Cio,  C20,  Po) 

M  Cm 

=  n  ( npivTK,  zu~)p(*Ti  fc,  0,  wm))p(j7m|v)p(wm|A) 

m= 1  i= 1 


(19) 


N  D 

JJp(V/»|a0)  (  Y[p(&hd\Td)p(Td\a,  b)^p(a0\T10,  r20) 

h=  1  d=  1 


The  sequential  update  equations  of  the  Gibbs  sampler  are  as  follows. 

•  Draw  /~™  from  truncated  normal  distribution  TArU™jf/”\  where  U7n  = 

(0,M  o  Wm) 7  x"1; 

•  Draw  r/m  from  multinomial  distribution  with  parameter  7rm 

Cr, 


-m  '  I'm 


irmh  oc  exp 


££*«  -  .(wrw,w[i« 


i=l  i=l 


X 


expJln\4  +  X]ln(l«^)> 

l  Kh  J 


(20) 


with  Wh  =  Qho  W.m  and  7rmh  =  — ; 

Z_,fc  =  l 

.  Draw  Wm  from  Gaussian  Distribution  with  mean  pwm  =  S„m  ^  Ej=i  ^ijzij(®v(m)° 

x"1))  and  covariance  E™  =  (  Eflj  Ej=j  <^(©r,(m)  °  xf)  (®r,(m)  °  xf  )T  +  , 

where  0^  =  [0h,  ©/,,  . . . ,  0/,]; 

•  Draw  Vh  from  Beta  distribution  Beta(vh i,vh2)  with  vhl  =  1  +  Yhm= i  1  (Vm  =  h) 

and  vh2  =  a0  +  YZ=i  Et>/»  =  0; 

•  Draw  o-Q  from  gamma  distribution  Gamma (ti,t2)  with  ri  =  iV  —  1  +  tlo  and 

r2  =  r20  -  Ehi1  ln(!  -  v0; 

•  Draw  0 /,,/  from  Bernoulli  distribution  with  parameter  p  =  1+cxp^_tmp) ,  where 

tmp  =  E"=1,  E£i  E”r,  6%  (*5  w™,x”5- 1- w^(x'aTx5-ehowJix»'wmdx5+ 

WL©MX32)  +  ln(Td)  -  'lit1  - 

.  Draw  Trf  from  beta  distribution  Betafai,  t^)  with  rrfi  =  a  +  E^=i  1(©m  =  1) 
and  rd2  =  b  +  J2h=o  1(&hd  =  0); 

In  this  subsection  we  derive  the  variational  Bayesian  approximation  of  the  exact 
posterior  distribution. 


For  simplicity  the  collection  of  all  available  data  including  features  and  labels  is  de¬ 
noted  as  D,  the  collection  of  all  hidden  variables  and  model  parameters  as  $  and  the  col¬ 
lection  of  hyper-parameters  as  \F.  In  our  model  V  =  {y,  x},  =  {t,  z,  r),  0,  W,  V,  r,  a0} 

and  =  {ri0,  r2o,  a,  b,  (30}.  By  BayesTaw  the  joint  posterior  distribution  of  parameters 
<I>  given  observed  data  V  and  hyper-parameters  is 


p($\V,  ^>) 


p(X>|d>)p(<I>|\l>) 

p(V\*j 


(21) 


where  p(V\^)  =  f  p ( V |  <1* ) p ( <b  |  )  d often  involves  high-dimensional,  complicated 

integrals.  The  variational  Bayesian  approach  [11],  [12],  [13]  approximate  the  joint 
posterior  p(<&  |  T>,  \F)  with  a  variational  distribution  y(<I>).  The  log  of  marginal  likelihood 


is  written  as 


9 


\np(V\ \Er)  =  In  /  p(£>|$)p($|\E,)<2$  =  In  /  g($ 

> 


p(P|$)p($|\I>)<i$ 


?(*) 


<?(<*>) 


(22) 


where  £(221$)  is  the  low  bound  of  lnp(I2|$).  The  problem  of  computing  posterior 
can  be  reformulated  as  an  optimization  problem  of  minimizing  the  Kullback-Leibler 
distance  between  q ( $ )  and  p($|D,  $),  which  is  equivalent  to  maximizing  the  lower 
bound  £(I2|$).  The  optimization  problem  can  be  analytically  solved  based  on  two 
assumptions  on  g($)  (i)  g($)  is  factorized;  (ii)  the  integral  over  $  of  g($)  should  be 
equal  to  one.  With  appropriate  choice  of  the  form  of  the  prior,  the  variational  distribution 
of  parameters  g($)  are  as  follows. 

.  Update  q{t™  =  j,  z%m)  for  m  =  1, 2, . . . ,  M,  i  =  1,  2, . . . ,  £m,  j  =  1,2,...,  nm: 


q(tT  =  j,  4U  oc  1,  i«) 


(23) 


where  U”1  =  Eh=i  Pmh((®h)  °  (W m))Txf; 


p{yT\x7V™m 


Update  q(ti‘  j )  yH=ip(vT\xT >uT)hTk 

Update  q(pm  =  h)  =  pmh,  where 


~'7n 


Pmh  oc  exp  <;  ^  ^^(^(W^x”1  -  -((x”1)7  (Wfcwj[)x? 
»=1  j=l 

exp  <  (In  14)  +  y^(ln(l  -  V))) 


x 


(24) 


i</i 


with  W,  =  0/,  O  Wm,  <W„)  =  (0h)  o  (Wm)  and  (W,W£)  =  ((©,©£)  o 
(W.mW^)).  After  Normalizing,  we  obtain  pmh  =  — ; 

l^k=l  Pmk 

Update  q( Wm);  The  variational  posterior  of  Wm  can  be  shown  to  be  normal 


with  covariance  Y>wm  =  (  E?=i  Sij  Eh=i  Tu((0^0D  °  (x™x!”7'))  +  j- 


rmT\ 


Po 


and  mean  pwm  =  Y,wm  Ei=i  Ej=i  Eh= i  Pmh(z^)({®h)  ox”1),  where  ©,,  = 

[@h,  &h,  and 


-l 


(&h.&h)  =  nr 


Of  r 

011012 


0H0 


11^12 

^2 


01101D  0120 


12”ld 


01101U 

O12O1D 
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Update  q(Vh);  the  variational  posterior  of  V/,  is  also  a  Beta  distribution  Betafvhi ,  ^2) 

with  1  T  JOm=:L  and  11^2  (®o)  T  X^m=i  Efii>h  Pmh- 

Update  g(a0);  the  variational  posterior  of  a0  can  be  shown  to  be  a  Gamma 
distribution  Gammo(r1,r2)  with  ri  =  iV— 1+Ti0  and  r2  =  r2o— ^^=1'  (ln(l — V /j,) ) 
Update  q(@hd)',  the  probability  of  Qhd  equal  to  1  is  proportional  to 

q(&hd  =  1) 
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M  C'm  nm 


<m=  1  2=1  j= 1 
M  C'm  nm 


KexP  EEE‘X:s)(w-K^ -(wL)(x3rx; 


x  exp  1 E  E  E  wL©mx™/  -  <®h  o  wm)Tx™wmix^ 


„m=l  z=l  j=l 

x  exp{(ln(rd))} 


(25) 


and  the  probability  of  ©/,,/  equal  to  0  is  proportional  to  exp{((l  —  hi(r,/)))}  ; 

•  Update  q{rd),  d  —  1,  2, . . . ,  D;  the  variational  posterior  of  rd  is  still  a  Beta  distri¬ 
bution  Beta(rdi,  rd2)  with  rdA  =  EL=i  <?(©m  =  1)  +  a  and  rd2  =  Ef= o  9(©w  = 
0)  +  6. 

We  here  also  discuss  how  active  learning  may  be  incorporated  into  this  framework. 
We  take  an  information-theoretic  approach  to  identifying  the  data  locations  at  which  the 
labels  would  be  most  informative  to  the  classifier  parameters.  Our  approach  is  based  on 
use  of  Fisher  information  [14],  [15],  which  is  related  to  previous  uses  of  active  learning 
[16],  [17]  as  applied  to  purely  supervised  models.  The  Fisher  information  involves  the 
log-likelihood;  as  a  result  the  prior  is  excluded  from  the  calculation.  Since  the  tasks 
are  connected  through  the  prior,  this  implies  that  calculation  of  Fisher  information  can 
be  performed  for  each  individual  task  separately  (not  independently  though,  since  the 
true  parameters  are  replaced  by  their  most  recent  estimates,  as  seen  below,  which  are 
coupled  by  the  prior).  Therefore,  we  drop  each  variable’s  independence  on  task  index 
m,  for  notational  simplicity.  The  data  log-likelihood  is  obtained  by  taking  the  logarithm 
of  (15), 


£{9)  D=  '  hr p({yi,i  G  £}|{At(xj)  :  i  G  £},  9) 

n 

=  ^2ln^bijp*(y,l\xv9)  (26) 

iec  j= l 


where  the  base  classifier  is  assumed  as  above  to  be  a  logistic-regression  classifier,  i.e., 
p*(yi\x.j,  9)  =  [l+exp{— g(x,:,  0)}]_1.  By  definition  [15],  the  Fisher  information  matrix 
(FIM)  for  the  data  likelihood  is 


where 


FIM  {p({yi,i  G  £}|{A4(x,t)  :  i  G  £},0)} 

l  T 
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EL  i  bikP*(yk  =  Vi 

xfc,  0) 

-i  T 
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E- 


i£C 


7  * 


z;  D=  ^hjP*(yj  =  l\xj,9)p*(yj  =  -l|Xi,0)xj 

3= 1 


(27) 


(28) 


1%=Y,  bikP*(yj  =  l|xfc,0)  yiH^jy,  =  -1|  xk,0) 
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(29) 


k= 1 


k=  1 


Assume  that  x*  G  {x^-  •  •  ,  xn}  \  C  is  a  newly  labeled  sample  and  added  to  the  labeled 
set,  so  C  changes  to  C.  The  Fisher  information  matrix  changes  to 


FIM 


E 

iec 


, i  G  £}|{M(xj)  :  i  e  £},6 


Z^4 

7* 


+ 


z*z 


7* 


(30) 


The  ratio  of  the  determinants  (one  of  several  possible  measures  [14],  related  to  the 
entropy  under  a  Gaussian  approximation  for  the  posterior  for  the  model  parameters)  of 
the  old  and  new  FIMs  is 


det 

V  ZiZf  z 

7i  + 

,z  r 

7* 

det 

y-  z»zf 

7i 

=  I  1  +  -< 

7* 

The  logarithmic  difference  is 
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zizi 
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(31) 


"0(x*)  =  H - Z* 

7* 


E 

_i£C 


T 

ZiZi 


7 i 


-1 


(32) 


which  we  employ  as  our  selection  criterion  in  identifying  the  most  informative  data 
sample  for  labeling.  The  criterion  -0(xj)  is  calculated  for  all  xy  e  (x1;  •  •  •  ,xn}  \  £, 
and  the  one  with  the  maximum  is  the  most  informative  data  location  to  obtain  a  label. 
The  true  value  of  0  required  in  calculating  z  and  7  is  replaced  with  the  most  recent 
update  of  the  parameters,  following  the  strategy  taken  in  [17],  [14],  To  the  best  of  our 
knowledge,  this  is  the  first  use  of  active  learning  in  an  MTL  setting,  and  we  are  also 
considering  a  semi-supervised  model. 


RESULTS 


To  evaluate  the  proposed  multi-task  feature  learning  algorithm,  experimental  results 
are  presented  on  four  data  sets,  one  based  on  synthetic  data  and  others  on  bench¬ 
mark  real  data.  In  order  to  compare  with  other  algorithms,  we  employ  AUC  as  the 
performance  measure,  where  AUC  stands  for  area  under  the  receiver  operation  curve 
(ROC)  [18].  The  relation  the  AUC  and  the  error  rate  is  discussed  in  [19]. 

Throughout  this  section,  the  basic  setup  prior  hyper-parameters  are  as  follows:  /30  =  1, 
t 10  =  0.05,  r2 0  =  0.05,  a  =  1,  b  =  1  and  Wm  are  set  according  to  sample  means.  The 
truncation  level  for  Dirichlet  process  is  the  number  of  tasks  and  the  initial  number  of 
latent  features  are  the  dimension  of  features. 


We  first  demonstrate  the  proposed  multi-task  feature  learning  model  on  a  synthetic 
data,  for  illustrative  purpose.  For  our  synthetic  example,  we  have  six  tasks  {xm,  ym},m  = 
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1,2, . . . ,  6.  The  features  xm  for  each  task  are  generated  from  a  Gaussian  distribu¬ 
tion  with  mean  zeros  and  covariance  identity  matrix.  The  dimension  of  datum  is 
50  and  we  generated  N  =  500  samples  for  each  task.  Assume  the  label  = 

1,2,...,  500  is  generated  from  ym{i)  =  sign{OjnyLm{i)  +  A/").  M  is  an  additive  white 
Gaussian  noise  with  a  signal-to-noise  ratio  of  10.  For  tasks  1,  2  and  3  feature  index 
{3,  7, 12, 14, 16,  31, 33, 47, 48,  50}  are  relevant  for  classification,  which  means  that  linear 
classifiers  6m ,  m  =  1,  2,  3  are  very  sparse  with  only  elements  (3,  7, 12, 14, 16, 31,  33, 47, 48,  50} 
non-zero.  For  tasks  4,  5  and  6  feature  index  {11,14,22,28,30,32,33,38,39,40}  are 
useful  for  obtaining  the  correct  classifiers.  The  truncation  level  of  Dirichlet  process  for 
the  synthesized  data  is  equal  to  six.  Since  ground  truth  is  available  for  this  synthesized 
example,  it  is  employed  as  a  starting  point  for  analysis  of  the  models. 

The  ground  truth  of  classifier  coefficients  0m,m  =  1,2...  6  as  well  as  classifier 
coefficients  learnt  with  the  proposed  semi-supervised  feature  learning  algorithm  are 
depicted  in  Figure  1.  In  the  Gibbs  sampling  implementation,  the  burning  period  is 
1000  and  the  results  are  the  average  of  500  iterations  after  burning  period.  The  results 
of  variational  Bayesian  is  the  average  of  100  random  trials. 

From  Figure  1,  we  can  see  that  the  proposed  semi-supervised  MTL  feature  learning 
algorithm  can  correctly  select  useful  features  for  each  task  and  can  also  obtain  a  very 
good  approximation  for  those  weights. 

Each  curve  in  Figure  2(a)  represents  the  mean  AUC  as  a  function  of  the  number  of 
labeled  data.  Here  we  compare  our  proposed  algorithm  to  the  Semi-supervised  MTL 
[10].  Figure  2(b)  gives  us  the  sharing  patterns  that  semi-supervised  MTL  feature  learning 
algorithm  finds  for  the  six  tasks,  we  plot  the  Hinton  diagram  [20]  of  the  between-task 
sharing  matrix  found  by  the  semi-supervised  MTL.  The  (i,  j)-th  element  of  sharing 
matrix  records  the  number  of  occurrences  that  task  i  and  j  are  grouped  into  the  same 
cluster.  The  Hinton  diagram  in  Figure  2(b)  also  shows  the  agreement  of  the  sharing 
mechanism  of  the  semi-supervised  MTL  with  the  similarity  between  the  tasks. 

As  seen  from  Figure  1  and  Figure  2,  three  observations  are  made: 

•  With  the  feature  selection  mechanism,  the  proposed  algorithm  may  help  to  improve 
classification  performance; 

•  Task  1,  2  and  3  are  grouped  together  and  task  4,  5  and  6  are  grouped  together, 
which  is  in  the  agreement  of  the  ground  truth; 

•  The  number  and  positions  of  zero-elements  for  task  1,  2  and  3  are  almost  same, 
although  the  non-zero  elements  have  different  values;  The  number  and  positions  of 
zero-elements  for  task  4,  5  and  6  are  almost  same,  although  the  non-zero  elements 
have  different  values; 

In  the  second  example  we  consider  the  underwater-mine  classification  problem  stud¬ 
ied  in  [21],  where  the  acoustic  imagery  data  were  collected  with  four  different  imaging 
sonars  from  two  different  environments  (see  [21]  for  details)1.  This  is  a  binary  classifica¬ 
tion  problem  aiming  to  separate  mines  from  non-mines  based  on  the  synthetic-aperture 
sonar  (SAS)  imagery.  For  each  sonar  image,  a  detector  automatically  finds  the  objects 
of  interest,  and  a  13-dimensional  feature  vector  is  extracted  for  each  target.  The  number 

'The  data  for  the  underwater  mine  example  are  available  at  www.ece.duke.edu/  Icarin/UnderwaterMines.zip 
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Fig.  1.  Classifier  coefficients  of  the  semi-supervised  MTL  feature  learning  algorithm  on  Tasks  1-6.  The  horizontal 
axis  is  feature  index  in  each  task.  The  vertical  axis  is  classifier  coefficients.  The  coefficients  for  synthesized  data 
are  donated  by  blue  circles;  the  learnt  coefficients  using  Gibbs  sampler  are  donated  with  green  stars;  the  learnt 
coefficients  with  variational  Bayesian  (VB)  are  donated  with  red  diamond. 


of  mines  in  each  of  the  eight  tasks  varies  from  9  to  65,  and  each  task  contains  from 
10  to  100  times  more  non-mines  (clutter)  than  mines. 

In  this  problem,  there  are  a  total  of  8  data  sets,  constituting  8  tasks.  The  8  data  sets 
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(a)  (b) 


Fig.  2.  (a)  Performance  of  the  semi-supervised  MTL  feature  learning  algorithm  on  Tasks  1-6.  The  horizontal  axis 

is  the  number  of  labeled  data  in  each  task.  The  vertical  axis  is  the  AUC  averaged  over  the  six  tasks;(b)  The  Hinton 
diagram  of  between-task  sharing  found  by  semi-supervised  MTL  feature  learning. 


are  collected  with  four  sonar  sensors  from  two  different  environmental  conditions.  The 
total  number  of  data  points  in  each  task  is  listed  in  Table  I.  The  distribution  of  sensors 

TABLE  I 

Number  of  data  points  in  each  task  for  the  underwater-mine  data  set  considered  in  Figure  ??. 


Task  ID 

1 

2 

3 

4 

5 

6 

7 

8 

Number  of  data 

1813 

3562 

1312 

1499 

2853 

1162 

1134 

756 

and  environments  are  listed  in  Table  II. 

It  can  be  seen  from  the  synthesized  example  that  variational  Bayesian  (VB)  approx¬ 
imation  can  achieve  almost  the  same  performance  as  Gibbs  sampler  but  with  much 
higher  speed.  In  this  subsection  we  only  present  results  of  VB  approach.  To  show  the 
strength  of  the  proposed  algorithm,  we  add  7  dummy  features  to  the  original  feature 
vector  and  extend  a  13-dimensional  feature  vector  to  a  20-dimensional  feature  vector. 

Following  [21],  we  perform  50  independent  trials,  in  each  of  which  we  randomly 
select  a  subset  of  data  for  which  labels  are  assumed  known  (labeled  data),  train  the 
semi-supervised  MTL  with  feature  selection  and  without  feature  selection  [10],  and  test 
the  classifiers  on  the  remaining  data.  The  AUC  averaged  over  8  tasks  is  presented  in 
Figure  3,  as  a  function  of  the  number  of  labeled  data  in  each  task,  where  each  curve 
represents  the  mean  calculated  from  the  50  independent  trials.  Figure  4  gives  us  one 
sample  of  the  weights  of  8  tasks  when  labeled  data  from  each  task  is  30. 

The  results  on  underwater  target  classification  also  show  that  the  proposed  semi- 
supervised  MTL  feature  learning  outperforms  the  semi-supervised  MTL  algorithm  [10] 
by  selecting  relevant  features. 

To  demonstrate  active  learning  integrated  with  multi-task  semi-supervised  learning, 
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TABLE  II 

Distribution  of  Sensors  and  Environments  over  8  tasks.  Environment  A  is  relatively 

CHALLENGING  WHILE  ENVIRONMENT  B  RELATIVELY  BENIGN,  WITH  THESE  CHARACTERISTICS  MANIFESTED 

BY  THE  DETAILS  OF  THE  SEA  BOTTOM. 


Task  Sonar  Environment 


1  1  B 

2  1  A 

3  2  B 

4  3  B 

5  4  B 

6  2  A 

7  3  A 

8  4  A 


(a) 

Fig.  3.  Performance  of  the  semi-supervised  MTL  feature  learning  algorithm  on  underwater  target  classification,  in 
comparison  to  semi-supervised  MTL  algorithm  [10].  The  horizontal  axis  is  the  number  of  labeled  data  in  each  task. 
The  vertical  axis  is  the  AUC  averaged  over  the  six  tasks  and  50  independent  trials. 


we  consider  a  remote  sensing  problem  based  on  data  collected  from  real  landmine 
fields2.  In  this  problem  there  are  a  total  of  19  sets  of  data,  collected  from  various 
landmine  fields  (with  inert  landmine  simulants).  Each  data  point  is  represented  by  a 
9-dimensional  feature  vector  extracted  from  synthetic  aperture  radar  images.  Since  this 
is  a  detection  problem,  the  class  labels  are  binary,  with  1  indicating  landmine  and  0 
indicating  clutter  (false  alarm).  We  have  also  demonstrated  this  technology  with  MCM 

:The  data  from  the  landmine  example  are  available  at  www.ece.duke.edu/  Icarin/LandmineData.zip 
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(a) 

Fig.  4.  One  sample  of  weights  when  labeled  data  from  each  task  is  30. 


data,  as  discussed  above. 

As  opposed  to  the  setting  in  [10],  where  it  is  assumed  that  labeled  data  from  the  19 
data  sets  are  available  simultaneously,  we  here  assume  the  much  more  realistic  case  for 
which  labeled  data  are  acquired  sequentially  within  one  data  set  (task)  at  one  time.  Once 
the  process  of  label  acquisition  in  a  given  environment  is  completed,  that  environment 
is  not  revisited  to  acquire  new  labeled  data. 

Each  of  the  19  data  sets  defines  a  task,  in  which  we  aim  to  find  landmines  with  a 
minimum  number  of  false  alarms.  Of  the  19  data  sets,  1-10  are  collected  at  foliated 
regions  and  11-19  are  collected  at  regions  that  are  bare  earth  or  desert.  We  expect  fewer 
new  labeled  data  when  considering  a  new  task  for  which  environmental  conditions  stay 
unchanged  from  previous  tasks  (but  this  is  inferred  by  the  algorithm,  and  not  imposed 
by  the  user). 

In  the  experiment  both  labeled  and  unlabeled  data  are  used  in  training  the  algorithm. 
After  training,  the  algorithm  is  tested  on  the  unlabeled  data  to  calculate  the  area  under 
ROC  curve  (AUC)  for  each  data  set.  We  compare  the  active-learning  results  with  AUC 
results  obtained  using  random  selection  of  labeled  data.  For  the  case  where  the  labeled 
data  are  randomly  selected,  we  perform  20  independent  trials,  and  compute  the  mean  as 
well  as  error  bars  of  AUC  from  the  trials.  Since  the  data  sets  are  acquired  sequentially, 
the  results  are  presented  as  AUC  as  a  function  of  the  number  of  tasks  from  which  labeled 
data  are  acquired  (the  ordering  of  the  tasks  is  arbitrary;  the  task  order  considered  here 
was  selected  as  to  make  a  point  on  the  number  of  labels  actively  acquired,  as  discussed 
further  below). 

We  observe  from  the  results  in  Figure  5  that  active  learning  performs  much  better 
than  random  selection  for  a  small  number  of  data  sets  (tasks).  As  discussed  below, 
the  total  number  of  labels  used  in  random  selection  of  labels  is  the  same  as  that  used 
for  active  learning.  When  the  number  of  tasks  increases,  the  benefit  of  active  learning 
diminishes  since  the  scarcity  of  labeled  data  is  overcome  via  multi-task  learning. 

In  Figure  6  we  plot  the  number  of  labeled  data  for  each  task,  as  a  function  of  task 
index.  For  the  active-learning  algorithm  the  total  number  of  labeled  data  is  n  =  174, 
across  all  19  tasks  (this  is  determined  adaptively,  by  the  proposed  algorithm).  For  the 
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Fig.  5.  Performance  of  active  learning  for  semi-supervised  MTL  algorithm  in  comparison  to  semi-supervised 
MTL  with  randomly-selected  labeled  data.  The  horizontal  axis  is  the  number  of  tasks  from  which  labeled  data  are 
acquired.  The  vertical  axis  is  the  AUC  averaged  over  the  19  tasks. 


random  selection  of  labeled  data,  the  data  from  all  19  tasks  are  put  together,  and  174 
samples  are  selected  at  random  for  labeling;  therefore,  the  number  of  labels  acquired 
per  task  is  not  constant  (the  data  in  Figure  6,  for  random  selection,  represents  one 
example).  For  the  active-learning  results  in  Figure  6,  note  the  big  jump  in  the  number 
of  labeled  data  at  task  k  =  11.  Recall  from  above  that  data  sets  1-10  are  from  generally 
foliated  regions  and  data  sets  11-19  are  from  regions  that  are  generally  bare  earth  or 
desert.  Therefore,  the  jump  in  Figure  6  at  k  =  11  is  consistent  with  expectations  based 
on  the  properties  of  the  environments. 

IMPACT/APPLICATIONS 

The  technology  is  of  significant  utility  for  MCM  and  ASW  applications.  We  have 
developed  a  new  classification  algorithm  for  multi-task  feature  learning.  By  utilizing 
the  clustering  property  of  Dirichlet  process  (DP)  and  feature  selection  property  of 
Beta  process,  our  algorithm  can  learning  classifiers  jointly,  sharing  information  as 
appropriate;  at  same  time,  the  algorithm  can  automatically  exclude  irrelevant  features 
and  learn  good  weights  for  relevant  features  for  each  task. 

TRANSITIONS 

The  technology  is  being  transitioned  from  SIG  to  the  Navy,  via  a  collaboration  with 
Dr.  Robert  McDonald,  of  NSWC  Panama  City. 

RELATED  PROJECTS 

SIG  is  executing  a  related  SBIR  on  active  learning. 


PUBLICATIONS 
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Fig.  6.  Number  of  labeled  data  using  active  learning  in  comparison  to  number  of  labeled  data  with  random  selection; 
for  the  latter,  this  is  one  random  example. 


Q.  Liu,  X.  Liao,  H.  Li,  J.  Stack  and  L.  Carin,  “Semi-supervised  multitask  learning,” 
IEEE  Trans.  Pattern  Analysis  Machine  Intelligence ,  vol.  31,  pp.  1074-1086,  June  2009. 

J.  R.  Stack,  G.  Dobeck,  Xuejun  Liao,  L.  Carin,  “Kernel-Matching  Pursuits  With  Ar¬ 
bitrary  Loss  Functions”,  IEEE  Transactions  on  Neural  Networks ,  Vol  20,  No  3,  pp. 
395-405,  March  2009 

H.  Li,  X.  Liao  and  L.  Carin,  “Active  learning  for  semi-supervised  multi-task  learning,” 
in  Proc.  IEEE  International  Conference  on  Acoustics,  Speech  and  Signal  Processing 
(ICASSP),  2009. 

Also  had  a  JUA  paper  published  in  2009,  details  omitted. 

PATENTS 

None. 

HONORS 

None. 
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